arXiv:1507.05780v2 [stat.CO] 29 Jul 2015 


Geometric ergodicity of the Random Walk Metropolis 
with position-dependent proposal covariance. 


Samuel Livingstone 

Department of Statistical Science, University College London, 
Gower Street, London WCIE 6BT, United Kingdom. 


Abstract 

We consider a Metropolis-Hastings method with proposal kernel A/'(a;, hG~^{x)), 
where x is the current state. After discussing specific cases from the literature, 
we analyse the ergodicity properties of the resulting Markov chains. In one 
dimension we find that suitable choice of G~^{x) can change the ergodicity 
properties compared to the Random Walk Metropolis case Af{x, h'S), either for 
the better or worse. In higher dimensions we use a specific example to show 
that judicious choice of G~^(x) can produce a chain which will converge at a 
geometric rate to its limiting distribution when probability concentrates on an 
ever narrower ridge as |a;| grows, something which is not true for the Random 
Walk Metropolis. 

Keywords: Monte Carlo, MCMC, Markov chains, Computational Statistics, 
Bayesian Inference. 


1. Introduction 

Markov chain Monte Carlo (MCMC) methods are techniques for estimating 
expectations with respect to some distribution 7r(-), which need not be nor¬ 
malised. This is done by sampling a Markov chain which has limiting distri¬ 
bution 7r('), and computing empirical averages. A popular form of MCMC is 
the Metropolis-Hastings algorithm mm, where at each time step a ‘proposed’ 
move is drawn from some candidate distribution, and then accepted with some 
probability, otherwise the chain stays at the current point. Interest lies in find¬ 
ing choices of candidate distribution that will produce sensible estimators for 
expectations with respect to 7r(-). 

The quality of these estimators can be assessed in many different ways, but a 
common approach is to understand conditions on 7r(-) that will result in a chain 
which converges to its limiting distribution at a geometric rate. If such a rate 
can be established, then a Central Limit Theorem will exist for expectations 
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of functionals with finite second absolute moment under 7r(-) if the chain is 
reversible Q 

A simple yet often effective choice is a symmetric candidate distribution 
centred at the current point in the chain (with a fixed variance), resulting in 
the Random Walk Metropolis (RWM) (e.g. [3]). The convergence properties of 
a chain produced by the RWM are well-studied. In one dimension, essentially 
convergence is geometric if tt{x) decays at an exponential or faster rate in the 
tails |4] , while in higher dimensions an additional curvature condition is required 
[5] . Slower rates of convergence have also been established in the case of heavier 
tails [S]. 

Recently, some MCMC methods have been proposed which generalise the 
RWM, whereby proposals are still centred at the current point x and symmetric, 
but the variance changes with x [21 la 1 [TOl [TT] . An extension to infinite¬ 
dimensional Hilbert spaces is also suggested in [12] • The motivation is that 
the chain can become more ‘local’, perhaps making larger jumps when out 
in the tails, or mimicking the local dependence structure of 7r(-) to propose 
more intelligent moves. Designing MCMC methods of this nature is particularly 
relevant for modern Bayesian inference problems, where posterior distributions 
are often high dimensional and exhibit nonlinear correlations |13| . We term 
this approach the Position-Dependent Random Walk Metropolis (PDRWM), 
although technically this is a misnomer, since proposals are no longer random 
walksj^ Other choices of candidate distribution designed with distributions that 
exhibit nonlinear correlations were introduced in m- Although powerful, these 
require derivative information for log7r(a;), something which can be unavailable 
in modern inference problems (e.g. [11]). We note that no such information 
is required for the PDRWM, as evidenced by the particular cases suggested in 
[ziiHiiiiini HI]- However, there are relations between the approaches, to the 
extent that understanding how the properties of the PDRWM differ from the 
standard RWM should also aid understanding of the methods introduced in [13] . 

In this article we consider the convergence rate of a Markov chain generated 
by the PDRWM to its limiting distribution. Our main interest lies in whether 
this generalisation can change these ergodicity properties compared to the stan¬ 
dard RWM with fixed covariance. We focus on the case where the candidate 
distribution is Gaussian, and in one dimension we establish necessary and suffi¬ 
cient growth conditions on the proposal variance and tail behaviour of Tr(x) for 
geometric ergodicity. Some of the results extend naturally to higher dimensions, 
but we also offer an illustrative example showing that the curvature condition 
can be alleviated when the proposal covariance is allowed to change with posi¬ 
tion. In Section [^necessary concepts about Markov chains are briefly reviewed, 
before the PDRWM is introduced in Section [Sj One dimensional results are 
given in Section]^ before those for higher dimensions in Section]^ and a discus- 


^We deal exclusively with reversible chains here, in the non-reversible case the requirement 
is a finite (2 -|- (5)th absolute moment. 

^The size of jump now depends on the current position in the chain. 
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sion in Section]^ Throughout 7r(-) denotes a probability distribution, and tt^x) 
its density with respect to Lebesgue measure. 

2. Markov Chains & Geometric Ergodicity 

We will work on the measurable space so that each Xt & X for a 

discrete-time Markov chain {Xt}t>o with time-homogeneous transition kernel 
P : fb xB [0,1], where P(x,A) = P[W+i G A\Xi = x] and P"‘{x,A) is defined 
similarly for Xi^n- All chains we consider will have invariant distribution 7r(-), 
and be both 7r-irreducible and aperiodic, meaning 7r(-) is the limiting distribution 
from TT-almost any starting point m- We use I • I to denote the Euclidean norm. 

In Markov chain Monte Carlo the objective is to construct estimators of 
E^[/], for some / : df —>• K, by computing 

1 " 

/n = -5I/(A,), X,^P\xor)- 

n ^' 

i=l 

If 7r(-) is the limiting distribution for the chain then P will be ergodic, meaning 
fn ° ^ from TT-almost any starting point. For finite n the quality of /„ 

intuitively depends on how quickly P^{x, •) approaches 7r(-). We call the chain 
geometrically ergodic if 


|lP”(x, •)-7r(-)||TV < Af(a;)p", (1) 

from TT-almost any x £ X, for some M > 0 and p < 1 , where || p (-) — ’^{•)\\tv ■= 
sup^gg |p(A) — r'iB)\ is the total variation distance between distributions p(-) 
and iy{-) [T5] . 

Geometric ergodicity implies that if E 7 r[|/P“''‘^] < oo for some (5 > 0, then 

v^(/n-E,[/]) 4Af(0,T;(P,/)), (2) 

for some asymptotic variance v{P,f). Equation ([^ enables the construction 
of asymptotic confidence intervals for /„ |15j . Several techniques now exist 
for constructing non-asymptotic confidence intervals (e.g. [ini nzi nH]), but 
at present it is not yet clear whether these can be applied in the same sort 
of generality as ([^. In some cases, such approaches rely on either geometric 
ergodicity or the equivalenl]^ condition of a spectral gap existing for P |19] , so 
0 must also be established for many of these non-asymptotic results to hold 
(e.g. [m). Geometric ergodicity is also often a requirement in establishing 

the stability of noisy Markov chains in which P is approximated due to either 
intractability or computational convenience [20l [21] (in other instances slightly 
weaker but related conditions are needed [H]). 


3 This is true for reversible chains. 
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In practice, geometric ergodicity does not guarantee that /„ will be a sen¬ 
sible estimator, as M{x) can be arbitrarily large if the chain is initialised far 
from the typical set under 7r(-), and p may be very close to 1. However, chains 
which are not geometrically ergodic can often either get ‘stuck’ for a long time 
in low-probability regions or fail to explore the entire distribution adequately, 
sometimes in ways which are difficult to diagnose using standard MCMC diag¬ 
nostics. 


2.1. Establishing geometric ergodicity 

It is shown in Chapter 15 of [23] that Q is equivalent to the condition that 
there exists a Lyapunov function V : X ^ [l,cx)) and some A < 1,6 < oo such 
that 

PV{x) < \V{x)-\-blc{x)^ (3) 

where PV{x) := J V{y)P{x,dy). The set C C T must be small, meaning that 
for some m € N, e > 0 and probability measure v{-) 


P^{x,A)>ei^{A), (4) 

for any x £ C and A G B. Equations @ and @ are referred to as drift and 
minorisation conditions. Intuitively, C can be thought of as the centre of the 
space, and (|^ ensures that some one dimensional projection of {Xt}t>o drifts 
towards C at a geometric rate when outside. In fact, ^ is sufficient for the 
return time distribution to C to have geometric tails [23]. Once in C, Q ensures 
that with some probability the chain forgets its past and hence regenerates. This 
regeneration allows the chain to ‘couple’ with another started at stationarity, 
giving a bound on the total variation distance through the coupling inequality 
US]. More intuition is given in |24j . 

Transition kernels considered here will be of the Metropolis-Hastings type, 
given by 

P{x, dy) = a{x, y)Q{x, dy) + r{x)6,,{dy), (5) 

where Q{x, dy) = q{y\x)dy is some candidate kernel, a is the ‘acceptance rate’ 
and r{x) = 1 — f a(x,y)Q(x,dy). Here we choose 


a(x, y) = 1 A 


T^(y)<i(xly) 

7r(x)q(ylx)’ 


( 6 ) 


where aAb denotes the minimum of a and 6. This choice implies that P satisfies 
detailed balance for 7r(-) [25], and hence the chain is reversible (note that other 
choices for a can result in non-reversible chains, see [IS] for details). In this 
case applies to a slightly broader class of functionals, namely those with 

EttIi/M < oo [IS]- 

Roberts & Tweedie [S], following on from [S^, introduced the following reg¬ 
ularity conditions. 
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Theorem 1. (Roberts & Tweedie). Suppose that 7r(x) is bounded away from 0 
and oo on compact sets, and there exists <5^ > 0 and Sq > 0 such that, for every 

X 

\x-y\<6q^ q{y\x) > Sq. 

Then the chain with kernel ^ is -irreducible and aperiodic, and every 
nonempty compact set is small. 

For the choices of Q considered in this article these conditions hold, and we 
will restrict ourselves to forms of 7r(x) for which the same is true (apart from a 
specific case in Section]^. Under Theoremthen Q only holds if a Lyapunov 
function V : X ^ [l,oo] with E^[U] < oo exists such that 


lim sup 

|af |—>-oo 


PV{x) 

V(x) 


< 1 . 


When P is of the Metropolis-Hastings type, Q can be written 

'V{y) 


lim sup 

| x |—>-00 » 


V{x, 


- 1 


a{x,y)Q{x,dy) < 0. 


In this case a simple criterion for lack of geometric ergodicity is 

lim sup r (a;) = 1. 

|a;|—>-oo 


(7) 

( 8 ) 

(9) 


Intuitively this implies that the chain is likely to get ‘stuck’ in the tails of a 
distribution for large periods. 

Jarner & Tweedie m introduce a necessary condition for geometric ergod¬ 
icity through a tightness condition. 

Theorem 2. (Jarner & Tweedie). If for any e > 0 there is a 6 > 0 such that 
for all X G X 

P{x,Bs{x)) > 1 - £, 

where Bs(x) := {y G X : d{x,y) < <5}, then P can only be geometrically ergodic 
if for some s > 0 


The result highlights that when 7r(-) is heavy-tailed the chain must be able 
to make very large moves and still be capable of returning to the centre quickly 
for 0 to hold. In the Metropolis-Hastings case it is straightforward to see that 

Q{x, Bs{x)) > I — £ ^ P{x, Bs{x)) > 1 — £, 
which is a useful approach to establishing lack of Q in the heavy-tailed case. 
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3. Position-dependent Random Walk Metropolis 

In the RWM, Q{x,dy) = q{\y — x\)dy, meaning ([^ reduces to a{x,y) = 
1A 7r(y)/7r(a;). A common choice is Q{x, •) = Af{x, hE), with E chosen to mimic 
the global covariance structure of 7r(-) [3]. Various results exist concerning the 
optimal choice of h in a given setting (e.g. [IH])- It is straightforward to see that 
Theorem holds here, so that the tails of Tr{x) must be uniformly exponential 
or lighter for geometric ergodicity. In one dimension this is in fact a sufficient 
condition [4] , while for higher dimensions additional conditions are required [5] . 
We return to this case in Section [5l 

In the PDRWM Q{x,-) = Af(x,hG~^{x)), so Q becomes 

a{x,y) = 1 A exp (- yf[G{y) - G{x)]{x - y)] . 

The intuition here is that proposals are more able to reflect the local dependence 
structure of 7r(-). In some cases this dependence may vary greatly in different 
parts of the state-space, making a global choice of E ineffective |^. 

Readers familiar with differential geometry will recognise the volume element 
\G(x)\^/'^dx and the linear approximations to the distance between x and y taken 
at each point through G{x) and G(y) if X is viewed as a Riemannian manifold 
with metric G. We do not explore these observations further here, but the 
interested reader is referred to [53] for more discussion. 

The choice of G{x) is an obvious question. In fact, specific variants of this 
method have appeared on many occasions in the literature, some of which we 
now summarise. 

1. Tempered Langevin diffusions 0 G ^{x) = TT ^{x)I. The authors high¬ 
light that the diffusion with dynamics dXt = Tr~2 (^Xt)dWt has invariant 
distribution 7r(-), motivating the choice. The method was shown to per¬ 
form well for a bi-modal 7r(a;), as larger jumps are proposed in the low 
density region between the two modes. 

2. State-dependent Metropolis m G ^(a;) = (I -|- |x|)^. Here the intuition 
is simply that b > 0 means larger jumps will be made in the tails. In 
one dimension the authors compare the expected squared jumping dis¬ 
tance E[(Ai_|_i — Xi)^] empirically for chains exploring a Af{0, 1) target 
distribution, choosing b adaptively, and found 6 « 1.6 to be optimal. 

3. Regional adaptive Metropolis-Hastings laill]. G-i(x) = 

In this case the state-space is partitioned into Ai U...U X^, and a different 
proposal covariance E^ is learned adaptively in each region 1 < i < m. An 
extension which allows for some errors in choosing an appropriate partition 
is discussed in m 

4. Localised Random Walk Metropolis Unj. G ^(x) = Here 

qe{k\x) are weights based on approximating 7r(x) with some mixture of 
Normal/Student’s t distributions, using the approach suggested in |5D] . 
At each iteration of the algorithm a mixture component k is sampled 
from qg{-\x), and the covariance E^ is used for the proposal Q{x,dy). 
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5. Kernel adaptive Metropolis-Hastings i- G-^{x) = 

where xk{zi, x),..., xk{zm 2 ;)] for some kernel function k and n 

past samples {zi, z„}, H = I — 1/nlnxn is a centering matrix, and 7 , v 
are tuning parameters. The approach is based around performing nonlin¬ 
ear principal components analysis on past samples from the chain to learn 
a local covariance. Illustrative examples for the case of a Gaussian kernel 
show that MxHM]^ acts as a weighted empirical covariance of samples z, 
with larger weights given to the Zi which are closer to a: [9]. 

The latter cases also motivate any choice of the form 

n 

G~^{x) = '^w^x, Zi){zi - x)'^ {zi - x) 

for some past samples {zi,...,Zn} and weight function w : df x fh —>■ [ 0 , 00 ) with 
w{x, Zi) = 1 that decays as \x — Zi\ grows, which would also mimic the local 
curvature of 7r(-) (taking care to appropriately regularise and diminish adapta¬ 
tion so as to preserve ergodicity, as outlined in m)- The logic of [T?l [5T] could 
also be applied, by choosing G{x) as some regularised version of the negative 
Hessian of log 7 r(a;). However, if such derivative information were available it 
would seem more sensible to use a more sophisticated method than a martingale 
proposal (see e.g. [E]). 


4. Results in One Dimension 


Here the specific choice of G{x) is left open, and we instead consider two 
different general scenarios as \x\ —>■ 00 , i) G~^{x) E, and ii) G“^(a;) —>■ 00 
at some rate. In theory there is also the possibility that G~^{x) —>■ 0, though 
intuitively this would not seem to be a particularly sensible choice as chains 
would be extremely likely to spend a long time in the tails of a distribution, so 
we do not consider it. 

Three scenarios are considered for the tail behaviour of tt{x). We refer to 
this density as log-concave in the tails if for some xq > 0 and a > 0 

Tr{y)/Tr{x) < Vy > a: > Xq, (10) 


and a similar condition holds in the negative tail. If (10) is not satisfied but 
there is some /3 G (0,1) such that the above condition can be replaced with 
^ exp{—a(j/^ — x^)}, then we call the density subexponential (note 
this is not the standard definition). Finally, we call 7r(x) ‘polynomial-tailed’ if 
7r(x) (X \x\~’P for large |a;| and some p > 1. We also apply asymptotic growth 
conditions for G~^{x)^ and without loss of generality assume that these hold for 
any x larger than the same Xq in absolute value. 

We introduce some asymptotic notation in this section. For positive real¬ 
valued functions / and g, let f{x) = Q{g{x)) imply f{x)/g{x) —>■ G > 0 as 
X —>■ 00 , and f{x) = uj(g(x)) imply f{x)/g{x) —>■ 00 . The more familiar big-0 
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and little -0 notation is also used. The main results of this section are summarised 
in Table 1 at the end of the section. 

The first result emphasises a growing variance as a necessary requirement 
for geometric ergodicity in the heavy-tailed case. 


Lemma 1. If G ^(x) < cr^, then the PDRWM can only produce a geometrically 
ergodic Markov chain ifir^x) is log-concave in the tails. 

Proof: In this case for any choice of e > 0 there is a (5 > 0 such that Q{x, Bs{x)) > 
1 — e, so Theorem 2 can be applied. ■ 

Though the heavy-tailed case is a challenging scenario, the standard RWM 
with fixed covariance will produce a geometrically ergodic Markov chain if 7 r(x) 
is log-concave. Next we extend this result to the case of sub-quadratic variance 
growth in the tails. 


Lemma 2. If G~^(x) = o(|xp) and tt{x) is log-concave in the tails, then the 
PDRWM method produces a geometrically ergodic Markov chain from n-almost 
any starting point. If Tr{x) is subexponential for some (3 G (0,1), then choosing 
G~^{x) = 0(|x|'>') for some 2(1 — /3) < 7 < 2 gives the same result. 

Proof: See [Appendix ATT] 


The log-concave proof consists of partitioning X into five regions, and show¬ 
ing that as |x| — 00 , Q evaluated over each of these regions will either become 
arbitrarily small or remain strictly negative. We use the Lyapunov function 
V(x) = for some s > 0. This choice allows results about moment gen¬ 
erating functions of truncated Gaussian distributions (see Appendix BI to be 


used, in conjunction with simple bounds on the cumulative distribution func¬ 
tion from [ 22 ], to establish that (|^ will become arbitrarily small for regions of 
A outside the ‘typical set’ (x — cx'’'/^,x -I- cx'*'/^). Theorem 3.2 from ji] shows 
that for the RWM with fixed covariance (|^ evaluated over this region will be 
strictly negative. The essence of the argument is that for y > x in the tails, 
Ofl(x, y) < e~°‘^y~^'> by log-concavity, so as long as s is chosen to be less than a 
this decay will dominate any growth in V (y) here. As for any inwards proposals 
afi{x,y) = 1 then it can be shown that (|^ is strictly negative when evaluated 
over this region. 

The crucial additional difficulty in the case of growing covariance is that the 
acceptance rate in this region (for suitably large x) is now 


a{x,y) = 1 A --exp 

7T[x) 


( 7 

X 

1 


y 



(x - y )2 


(x-y)^ 1 \ 
XT' \) 


The problematic term lies inside the square bracket: this will be negative for 
y > X, meaning a large positive component in a(x,y). To deal with this, we 












use a Taylor expansion of y~'^ about x and some simplifications to show that 
provided 7 < 2 , for large enough cc, locally (for y near a;, where the choice of 
region plays a role) the acceptance rate will still satisfy 

a{x,y) = \ for y < x, a{x,y) < , for y > x, 


where Sx can be made arbitrarily small. This allows us to use a similar argument 
to that in |3] to prove the result. Outside of this region the Gaussian tails of 
Q{x, •) take care of any less desirable behaviour of a{x, y). To extend this result 
to the subexponential case, we choose V{x) = and Taylor expand \y\^ in 

the typical set to get a suitable bound on a{x,y). 

Note that this lemma includes as a special case any instance in which 
G~^{x) t as |a;| —)■ 00 . However, the case G~^{x) —>■ cr^ from any direc¬ 
tion is actually more straightforward to show, by simply moving x for enough 
into the tails that for all y G (x — -|- cx'*^'^). In this case 

the argument in |4j can be applied more straightforwardly. 

Although we do not formally prove that the method will not produce a 
geometrically ergodic chain in the polynomial tailed case when G~^{x) = o(|xp), 
we show intuitively that this will be the case. Assuming that in the tails 7 r(x) oc 
|x|“P for some p > 1 then for large x 


/ 2 ; \P+7/2 / 

a(x, X -|- cx'^'^) = 1 A ( - — I exp ( 

\x + cx'yGj ^ \ 




(x -|- CXT'G)'' 



The first expression on the right hand side converges to 1 as x —)■ 00 , which is 
akin to the case of fixed proposal covariance. The second term will be larger 
than one for c > 0 and less than one for c < 0. So the algorithm will exhibit the 
same ‘random walk in the tails’ behaviour which is often characteristic of the 
RWM in this scenario, and so the acceptance rate will fail to enforce a geometric 
drift back into the centre of the space. 

In the case where 7 = 2 this will not happen, as the terms in the above 
expression will be roughly constant with x. We examine this case next. 


Lemma 3. IfG~^{x) = 0(|xp), then there is a ho > 0 such that for a step-size 
h G (0, ho) the PDRWM method produces a geometrically ergodic Markov chain 
from TT-almost any starting point, provided 7r(x) < |x|“P in the tails for some 
p> 1. 

Proof: See [Appendix A. 2} 

Here the intuition is that proposals in the tails will take the form y = 
(1 -f f,\/h)x, which if h is chosen to be small will be similar to y = e^'^x. The 
latter scheme is sometimes called the multiplicative RWM, and is known to be 
geometrically ergodic in this scenario (e.g. 0), as this equates to taking a log- 
transformation of X, which ‘lightens’ the tails of the target density to the point 
where it becomes log-concave. 
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In this case we take the Lyapunov function V{x) = IV with s > 0 chosen 
such that / V{y)TT{dy) < oo. We again divide the integral Q into regions, but 
in this case we show that each of these can be appropriately bounded simply as 
functions of the step-size h, i.e. independently of x. By examining each term, 
we show that for a small enough h the integral will be strictly negative. 

The result is positive, but in this case is perhaps an example where the 
theory does not necessarily translate into an effective scheme in practice. If 
Tr(x) has particularly heavy tails, for example, then it is likely that an extremely 
small value of h would be needed to ensure meaning the geometric rate of 
convergence p would be close to one. Nonetheless, it is an example of how 
appropriate choice of G~^(x) can favourably change the ergodicity properties of 
a sampler. 

The final result of this section provides a note of warning, that lack of care 
in choosing G~^{x) can have severe consequences for the method. 


Lemma 4. If G ^(cc) = ijj{\x\'^), then the PDRWM method can never produce 
a geometrically ergodic Markov chain provided 7r(x) 0 as |a:| —)■ oo. 

Proof: See [Appendix A. 3} 

The intuition for this result is straightforward when explained. In the tails, 
the average proposals will be of size which will be much larger than |a;| 

if 7 > 2, meaning most will send the chain even further into the tails in either 
direction (and hence will likely be rejected). To make this rigorous we show that 
([^ holds here, by considering the set of proposals -.= {y £ X : a(x, y) > e}, 
and showing that Q{x, A^^e) —0 as \x\ —>■ oo, for any e > 0. A specific example 
is illustrated in Figure 1. 







1 


... . ..7 

r . 'Ji'".. 


Figure 1: Example with Tr(x) oc e“bl^ G~^{x) oc |3;|^. The black triangle denotes the current 
state, points highlighted in blue represent proposals with a(x, y) > 0.5, with all others high¬ 
lighted in red. For large |a;| the majority of proposals miss the centre of the space and are 
rejected. 


The main results of this section are summarised in Table 1. 
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Variance 

Polynomial Tails 

Subexponential 

Log-concave 

G~Hx) = o(|a;|2) 

X 

/+ 

/ 

G-i(cr) = 0(N2) 


/* 

/* 

G ^lx)=uj{\x\‘^) 

X 

X 

X 


Table 1: Summary of one dimensional results. Here f{x) = aj(g(x)) means f/g —> oo as 
X —>■ oo, f{x) = &{g{x)) means f/g —> C > 0, /means geometrically ergodic, /■*" means 
geometrically ergodic provided G~^{x) S ©(|a:|'t) for some 2 > 7 > 2(1 — /3), and /* means 
geometrically ergodic provided h is suitably small. 


5. Higher Dimensions 

Some results from the previous section naturally carry over to higher dimen¬ 
sions. The most straightforward is outlined below. 


Lemma 5. If each element ofG~^{x) is bounded above (uniformly in x), then 
the PDRWM can only produce a geometrically ergodic Markov chain if the tails 
of Tr(x) are uniformly exponential or lighter. 

Proof: As with Lemma 1, a straightforward application of Theorem 2 gives the 
result. ■ 

It is also intuitive that an analogue to Lemma 4 will exist here. Specifically, 
if any diagonal component of the covariance G~^{x) grows at a faster than 
quadratic rate with x, then the sampler is likely to run into the same difficulties 
in the tails. Similarly, when G~^{x) —>■ E, it is straightforward to see that the 
sampler will inherit the geometric ergodicity properties of the RWM with fixed 
covariance, by a similar argument to that discussed for the proof of Lemma 2 
in this case. 

As mentioned earlier, in the case G~^{x) = E, additional conditions on Tr{x) 
are required for geometric ergodicity in more than one dimension, outlined in 
[ 5 ]. An example is also given in the paper of the simple two-dimensional density 
Tr(x, y) oc exp (—— y^ — x'^y'^), which fails to meet this criterion. The difficult 
models are those for which probability concentrates on a ‘ridge’ in the tails, 
which becomes ever narrower as |a;| increases. In this instance, proposals from 
the RWM are less and less likely to be accepted as |a;| grows. The problem 
is illustrated graphically in Figure 2. Such densities are often encountered as 
posterior distributions in hierarchical models, with another well-known example 
being the ‘funnel’, discussed in [33]. On the same figure there is some graphical 
evidence that if the proposal covariance is allowed to adjust then this problem 
can be alleviated somewhat. 

To explore this more concretely, we design an extremely simple two dimen¬ 
sional density which exhibits the same features, which we call the ‘rectangle’ 
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Figure 2: Contours of the density 7r(a:, y) oc exp (—— x^y^). The left-hand plots show 
that a RWM with spherical covariance will find it increasingly difficult to propose values which 
will be accepted as the chain moves into the tails. The right-hand plots suggest that allowing 
the covariance to change with position might alleviate this issue. 


density 

□ (x) cx R:={yG > 1 , \yi\< 

where int(z) is the integer part of 2 € K. This is simply a distribution defined 
over a sequence of rectangles on the upper-half plane on (starting at y 2 = 1), 
each centred on the vertical axis, with height one and with each successive 
triangle a third of the width and depth of the previous. Intuitively, the density 
is an ever narrowing staircase, as shown in Figure 3. 

For simplicity here we take the Random Walk Metropolis proposal as sim¬ 
ply a uniform distribution on the circle of radius one about the current point, 
so Qr{x,A) = \An 53,1/153,1, where := {y G \y - x\ < 1}. To im¬ 
itate the changing covariance in the PDRWM, we take as a proposal a uni¬ 
form distribution over an ellipse for which the width is jf g^j._ 

rent position is a: = (xi,X 2 ) G K^, so Qp(x,A) = jA n Ex\/\Ex\, where 
Ex = {y & -I- (j /2 — 3 ^ 2 )^ < !}■ For these choices 

many of the calculations required in this section reduce to calculating areas of 
rectangles and ellipses. 

The rectangle density does not satisfy the conditions of Theorem 1, as □(a;) 
is not bounded away from zero on compact sets, however any small set here must 
still be compact for both methods specified. To see this, note that for any fixed 
m < 00 , supp{P]/(a;, •)} is compact, so that for a minorisation condition of the 
form 0 to hold within some small set C, then we must have that supp{:/(-)} C 


12 
















supp{P]^(a;, •)} n supp{P]j‘(?/, •)} for every x,y G C. As this intersection will 
only be non-empty for bounded \x — y\^ C must be compact. The same argument 
holds for the elliptical case. Because of this, establishing Q is still sufficient to 
characterise lack of geometric ergodicity. 


Lemma 6. The Metropolis-Hastings algorithm with proposal Qn does not pro¬ 
duce a geometrically ergodic Markov chain when tt{x) = □(a:). 


Proof: It is sufficient to construct a 
sequence of points Xp G such that 
\xp\ —>■ oo as p —?> oo, and show that 
r{Xp) -G 1. Take Xp = (0,p) for 
p G N. In this case r{xp) is bounded 
below by one minus the area of the 
rectangles that Xp is on the boundary 
of divided by the area of the circle 
152,1 = TT. So we have 

V p) - 3P-1^ J 

as p —>■ 00 , as required. ■ 

The approach makes it clear that 
reducing the area of an ellipse at the 
same rate as the area of the rectangles 
will remove this issue. The next result 
conhrms this intuition. 



Figure 3: The rectangle density. 


Lemma 7. The Metropolis-Hastings 
algorithm with proposal Qp produces 
a geometrically ergodic Markov chain 
when Tr(x) = □(x), from n-almost any starting point. 


Proof: We can take as a small set C = {y G 1 < j/i < 2}, i.e. the largest 
rectangle on the contour plot, and the Lyapunov function V{x) = |x 2 | -|-1V |xi|. 
For x,y G R, V{y) < V{x) iff p 2 < X 2 . Note also that a{x,y) = 1 for any 
x,y G Rn {y G X : P2 < X2}. It suffices, with these choices, to show that the 
overlap on the contour plot between the lower hemisphere of each and R is 
larger than that between R and the upper hemisphere for any x G R\C, which 


is clearly true from inspecting the hgures in Appendix C 
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6. Discussion 


In this paper we have analysed the ergodic behaviour of a Metropolis- 
Hastings method with proposal kernel Q{x,-) = Af{x,hG~^{x)). In one di¬ 
mension we have characterised the behaviour in terms of growth conditions on 
G~^{x) and tail conditions on the target distribution, and some cases in higher 
dimensions have also been discussed. The fundamental question of interest was 
whether generalising an existing Metropolis-Hastings method by allowing the 
proposal covariance to change with position can alter the ergodicity properties 
of the sampler. We can confirm that this is indeed possible, either for the bet¬ 
ter or worse, depending on the choice of covariance. The take home points for 
practitioners are i) lack of sufficient care in the design of can have se¬ 

vere consequences (as in Lemma 4), and ii) careful choice of G~^{x) can have 
much more beneficial ones, particularly in higher dimensions, as evidenced by 
the ‘rectangle’ density example in Section 5. 

We feel that such results can also offer insight into similar generalisations of 
different Metropolis-Hastings algorithms (e.g. [13l|34]). For example, it seems 
intuitive that any method in which the variance grows at a faster than quadratic 
rate in the tails is unlikely to produce a geometrically ergodic chain. There 
are connections between the PDRWM and some extensions of the Metropolis- 
adjusted Langevin algorithm [^, the ergodicity properties of which are dis¬ 
cussed in [35]. The key difference between the schemes is the inclusion of 
the drift term G“^(a;)V log7r(a;)/2 in the latter. It is this term which in the 
main governs the behaviour of the sampler, which is why the behaviour of the 
PDRWM is different to this scheme (note that gradients are required for all 
variants, unlike in the PDRWM). 

We can apply the general results to the specihc variants discussed in Sec¬ 
tion 3. Provided sensible choices of regions/weights, and diminishing adapta¬ 
tion schemes are chosen, the Regional adaptive Metropolis-Hastings, Locally 
weighted Metropolis and Kernel-adaptive Metropolis-Hastings samplers should 
all satisfy G~^{x) —>■ S as |a:| —>■ oo, meaning they will inherit the ergodicity 
properties of the standard RWM (the behaviour in the centre of the space, how¬ 
ever, will likely be different). In the State-dependent Metropolis method pro¬ 
vided b < 2 (with suitable tuning in the equality case) then the sampler should 
also behave reasonably. Whether or not a large enough value of b would be 
found by a particular adaptation rule in the subexponential case is not entirely 
clear, and this could be an interesting direction of further study. The Tempered 
Langevin diffusion scheme, however, will fail to produce a geometrically ergodic 
Markov chain whenever the tails of Tr{x) are lighter than that of a Cauchy dis¬ 
tribution. In the case of Gaussian tails, for example, G~^{x) = To allow 

reasonable tail exploration, two pragmatic options would be to upper bound 
G“^(a;) manually or use this scheme in conjunction with another, as there is 
evidence that the sampler can perform favourably when exploring the centre 
of a distribution jS]. None of the specific variants discussed here are able to 
mimic the local curvature of the tt{x) in the tails, so as to enjoy the favourable 
behaviour exemplified in Lemma 7. This is possible using Hessian information 
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as in m, though should also be possible in cases where this isn’t available using 
appropriate surrogates, at least in some cases. 

It is reasonable to ask whether exploring the tails of a distribution adequately 
is always necessary. If the functions a practitioner is interested in estimating are 
such that f{x)n{dx) « / f{x)Tr{dx), where 7r(-) is the target restricted to the 
centre of the space C, then perhaps this is not so important. Some results in this 
direction are given in |dbj . If this approach is taken, however, whether or not 
a sampler will perform appropriately becomes a considerably more problem- 
dependent question. Geometric ergodicity, whilst by no means guaranteeing 
sensible estimators in the non-asymptotic context, does give steps towards this 
in some generality, through ([^. As mentioned earlier, it also appears to have 
other favourable consequences [Hill]. As such, we feel it is a property worth 
establishing. 
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Appendix A. Proofs 


Appendix A.l. Proof of Lemma 2 

For the lo^concave case, take V(x) = for some s > 0 , and let Ba denote 
the integral ^ over the set A. We first break up A into (—00, 0 ] U ( 0 , x — ] U 

(x — cx 2 , X + CX2 ] U (cx 2, X + ex'*'] U (x + ex'*', 00), and show that the integral is 
strictly negative on at least one of these sets, and can be made arbitrarily small 
as X —>■ 00 on all others. The —00 case is analogous from the tail conditions on 
7 r(x). 

On (00, 0 ], we have 


D _ ^ — SX 

-^( 00 ,0] ^ 


< e 


/ O pO 

e®l^la(x,y)Q(x,dy) - / a{x,y)Q{x,dy), 
-00 J —00 

POO 

/ e®^(3(-x,dy). 


The integral is now proportional to the moment generating function of a trun¬ 
cated Gaussian distribution (see Appendix B[), so is given by 


^ — sx+x'^hs^j2 


1 - $ 


(xi-^/V/ri/2 _ /ii/253,7/2^ 


A simple bound on the error function is \/^x^ ^{x) < e 1 "^ [ 32 ], so setting 
77 = — hfl'^sx^l'^ we have 

1 / hs^ 1 \ 

B(ao,o] < exp I — 2 sx -I —— a — 2 sx + hs^x'^) -b log77 ) , 

\/ 2 Tr \ 2 , 2 J 


1 


exp sx — ^2;^ ^ + log 77^ . 


which —>■ 0 as X —> 00 , so we can make this arbitrarily small. 

On (0, X — cx'*'/^], note that — 1 is clearly negative throughout this 

region. So the integral is straightforwardly bounded as B(^ x-cx-y/'^] ^ 0 for all 
X e A. 

On (x — cx^/^, X + cx^/^], provided x — cx^/^ is large enough that we are in 
the tail regime, then for any y in this region 


a[x, y) < exp ( -a{y - x) + - log 


- ^ [(2; - yfy ^ - (2: - y^x . 


A Taylor expansion of 7 / ^ about x gives 


7 / '*' = X 


— O — 1 

r ' 

and multiplying by ( 7 / — x)^ gives 
{y-xfy-^ = 


7X ' *( 7 /- x)-b 7(7-b l)x ^( 7 /- x)^-b ... 


2 -7 _ {y-xf jy- 


XT' 


xT+1 


+ 7 ( 7 + 1 )- 


xT-l-2 
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If \y — x\ = cx'^l'^ then this is: 


(?X^ 

x^ 


^ 33 . 37/2 






As 7 < 2 then 87/2 < 7 + 1 , and similarly for successive terms, meaning each 
gets smaller as \x\ —>■ 00 . So we have for large x and y £ {x — cx'^^'^,x + cx'^^'^) 


(y - xfy-'^ « ~ 

[y X ) y - 7 ^^+1 • 


(A.l) 


Using (A.l I gives (for large enough x) 


aix, y) < exp ( -a(y - x) + - log 


1 {y- xf 


2h^ ccT'+i 


So we can analyse how the acceptance rate behaves. First note that for fixed 
e > 0 


/ T 

a{x, x + e) < exp ( —ae + log 


X + € 


1 


-7 


2h x^+^ 


exp(—ae). 


Similarly we find that the 6 ““*^ term will dominate for any e for which > 

0, i.e. any e = o(x'^'^^^^). If 7 < 2 then e = satisfies this condition. So for 
any y > a: in this region we can choose an x such that 

Q!(a:, y) < exp {-a{y - x) + 4 ), 

where 5x can be made arbitrarily small in this region by choosing a large enough 
x. For the case y < x here we have (for any fixed e > 0) 


(ae+hog 

X 

V 2 

X — e 


1 


2/i^a;3'+i 


exp(ae). 


So by a similar argument we have a{x, y) > 1 here for large x, as the exponential 
term will dominate. Combining these results we can write 


^7/2 




( X — C 2; ■7 / 2 , x+C X ■7 / 2 ] 


^{s-a)z+5^ _ ^-az+5^ ^-sz _ ^ 


qxidz), 


•7/2 
fCX ' ' 

/ (l-e-'*^)(l-e(^-“)^+^^)g,(dz), 

an 


which will be strictly negative for large enough x provided s < a, where Qxi') 
denotes a zero mean Gaussian distribution with the same variance as Q{x, •). 
On {x + X + cx'’'] we can upper bound the acceptance rate as 


r(y) 


(a:, y) < exp - log 

7T{x) ' ^ 


1 |G(y)| , G(x) 


|G(x)| 2h 


{x - y)‘ 
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which can be made arbitrarily small provided c > 2s. I 

For the subexponential case, the proof is similar. Take V(x) = and 

divide X up into the same regions. Outside of {x — x'^^^,x + x^l'^\ the same 
arguments show that the integral can be made arbitrarily small. On this set, 
note that in the tails. 

[x + cx'^f -x^ = + I3{l3 - l)x^^+l^-^ + ... 

For y — X = cx^, then for y < 1 — /3 this becomes negligible, otherwise it 
will grow as x does. So in this case we further divide the typical set into 
{x, X + cx^~^] U (x + cx^~^, X + cx^!'^). On {x — cx^~^, x + cx^~^) the integral 
is bounded above by e~‘^^Q{x, {x — cx^~^,x + cx^~^)) —>■ 0, for some suitably 
chosen ci > 0. On (x — x — cx^~^] U (a; + cx^~^,x + then for y > a; 

we have a{x,y) < gp ^gg ^]^g game argument as in the the 

log-concave case to show that the integral will be strictly negative in the limit. 
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Appendix A. 2. Proof of Lemma 3 

Here a typical proposal will be y = a: ± fVhx for x sufficiently large, meaning 
\^ ~ y\ = fVhx, with ^ ~ A/'(0,1^). For now we assume both x and y are in the 
tail regime, meaning G{y) cx y~^ and similarly for G{x) (we make this concrete 
later). We can also take 7 r(y)/ 7 r(a;) = x^fy^ here. 

For y = (1 + f\/h)x then in the tails the acceptance rate becomes 


a{x,y) = 1 A 




(evh 

2 + 

i ^ 

_(l+eV^)2_ 


which is completely independent of x. 

Take V{x) = lV|a:|®, for some s < 1 which is suitably small that J V{y)n{dy) 
< 00 , together with an extra restriction which we specify later. Then V (y) jV(x) 
becomes independent of x also. The integral of interest can now be re-written 
in terms of with m(-) a standard Gaussian measure, (/>(^) its density, and 
ahif) the acceptance rate. So in most of the regions we consider we can choose 
X large enough that the integral in question is 


|1 -k - ij ahiOMdf)- 


(A.2) 


We therefore need to show that this integral is strictly negative for h small 
enough, and take care of the values of y which may not fall into this region. 

Again denoting ^ integrated over a region A as Ba, we can break (A.2) up 
into 


B(rx-) oo't — B / _ i\ B /■ _ 1 ^ B / \ 

^ ^ ( —C30, — 2 ) {—2h ^ ^ — Sh 4 ) (—Sh 4 ^Sh 4j (Sh 4 ^cjo) 


= Bhi + Bh 2 + Bhs + Bh^ ■ 


It is clear that all of these integrals can be made arbitrarily close to zero by 
making h small enough. The goal is to show that < 0 for all h G (0, hg). 

We proceed by finding the order of h of each Bj^.. 

On Hi = ( —oo, — 2 h “3 ) we have 


Bhi < 



|l + ^v^|"-l 


exp 



dc 


Use the change of variables 7 = 1 -|- gives 

/ —I pCCi poo 

[I 7 I® - 1 ] 771 (^ 7 ) = / (y® - 1 ) 771 (^ 77 ) < / T]m{dr]), 

-00 Ji Ji 


with T] ^ Af)—1, h), as s < 1. Using results for truncated Gaussians, we have 


rim{dri) = —<i> 




$ 


2 

y/h 


1 - $ 
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The lower bound on from [35] gives 


Bh, < 


2 + h I h 



On H 2 = {—2h 2 ^—Sh 


the function 


ii+cv^r-i 


is negative in 


H 2 , so this integral is trivially bounded as < 0 for any h. Note that this is the 
entire set of y’s for which (A.21 is not the correct integral. 

On H 3 = recall that the acceptance probability is 


Ohio = exp ^-(p + l)log(l + ^A) + ^ 

For any ^ > 0 we have 

2 + ^Vh 2{l + ^Vh) ^ 

(i+ev^)2 (i + ev^)2 ’ 2 

meaning 

ahiO < exp (-{p + 1) log(l + ^Vh) + . 

We would like to write this as (1 + for some a > 0. If 6 hi < 1 we 

can use a Taylor expansion with remainder log(l + x) = x — (2 + r^/3 for 
some r G (0, x) to get the bound x — x^ 12 < log(l + x) for 0 < a; < 1. For any 
6 < p + 1 then 

blog{l+^Vh) > b (^^y/h- for ^ G {0,Sh~i), S < 


2 + ^y/h 

(l + eVh)2 


<eh, 



So provided S is chosen in this way then 3a > 0 such that ah{0 ^ (1 + “ 

for ^ G {0,Sh~i) and a = 1 for ^ G {—dh~i,0) (by simply reversing the signs 
in the above inequalities). Now the integral of interest can be written 


pSh 4 


Bhs < 


(1 + - (1 + ^y/h)-‘^ + (1 - ^y/hy - ll m{dO. 


So we need to bound 


J {l + ^y/Jiy — J (l + ^y/h) “m(d^) + J {l — ^y/hym{d^) — ^^{Sh 


Upper and lower bounds for g{^) = (1 + ^y/h) “ on {0,6h 4 ) are 


r 

guiO = + 1 , TO„(a) = — (1 + - 1 

giiO = mia)^ + 1 , mi{a) =-ay/h. 
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The first is a straight line through g{Sh~i) and g{ 0 ) = 1 , the second is the 
straight line through ^(0) = 1 with gradient g'( 0 ) (as the function is concave). 
This gives upper and lower bounds for the first two integrals as 

mu(a - s)'l'h + ^(Sh~^) - and ■mi(a)'l'h + ^(Sh^) - 

where = (j>{ 6 h~i) — Ij^f^ < 0 . We can construct a similar Taylor Series 
upper bound for (1 — as a straight line with gradient to* = —sVh (as 

this function is concave), meaning the total bound of interest is 


Bh 3 < {rriuia - s) - mi{a) + to*)^'/i, 

where C//3 = {a—s)\/h-\-^ (^{1 -\- dh^y~^ — 1 ^. It is clear that C/Zg is positive, 
as 

1 / 1 \ 

1 + (a — s)Sh^ > 1 > f 1 + Sh^j 

because {a — s) > 0. 

On i/4 = {Sh~^^^,oo), bounding in the same way as for Hi, we set 7 = 
1 + Wh, meaning 7 ~ ^( 1 , h). Then 


Bh3 < / _1 [Sr - 1] w(d7), 

J Sh 4 


which can be re-written 

[ 17 ^ - 1] <i>^( 5 h-J) < E^ [7] $=(<i/i-J), 

= (1 -b (5hi)$“((ih-J) -b Vh(t>iSh-i), 


where w is now a truncated Gaussian distribution on (1 + Sh* , 00) with mean 
1 and variance h. Using the upper bound on gives 


Bh4 < (1 + Sh*) 


'hi 


1 h* 
S 


exp - 


= \/ ^ ( + 7 ) exp ( - 

= CHi exp ( - 


2 i/hJ 

2V hj 


6 ^ \ 




2VhJ 


Combining inequalities, we can get a very loose upper bound on the integral 


as 


B(^—00^00) — “t“ Ch^) exp ^ ‘ 2 \SJj^ 


Chi exp ( - Ch3- 
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The exponentials are the dominant terms in the first two expressions, as they 
shrink to zero much faster than any of the Cui terms (which still depend on h). 
To see that this is the case for C// 3 , note that (1 + 5 / 14 )®““ = 1 + (s — a)Shi + 
0{h^), so that 

, 1 

Ch, = {a- s)Vh + ^ ((1 + - 1 ) , 

, 1 

= (a — s)\/h + — (^—{a — s)6h^ + 0{h^)^ , 

= 0{hi). 

It is more straightforward to see that Chi and Ch^ are both 0{h^). Because 
of this, we can always choose a h small enough that the last term is arbitrarily 
larger than all others in the expression, meaning that the integral is strictly 
negative, as required. 


Appendix A. 3. Proof of Lemma 4 
The goal is to show 


lim sup 


a{x,y)Q{x,dy) = 0. 


The general strategy will be to find some set 


Ax,e ■■= {y & X : a{x,y) > e}. 


In words, a set which shows the potential candidate moves which have a non- 
negligible probability of acceptance. We will then establish that Q{x, Ax,e) —> 0 
as X —>■ 00 , for any e > 0. 

First recall that for the algorithm in general the acceptance probability for 
a proposal y is 


a{x,y) 


T^{y)\Giy)\^ 

7r(x)|G(x)|5 


exp 


(-^(2/ - x)^[G{y) - G(a;)]^ . 


If G(x) = 0{\x\ then for large enough x and y the acceptance probability is 


a{x,y) = I A 


'xjy) 

tt(x) 





As each Q{x, ■) is a Gaussian distribution, we consider a ‘typical set’ to be 
Tx = — 2Vhx'^G^ ^ + 2Vhx'^G^ ^ 

For any x, Q{x, Tx) « 0.96. If we can show that i) for large enough x, Ax,e C Tx, 
and ii) the ratio Q{x, Ax^e)/Q{x,Tx) —)■ 0 then we will have established the 
result. 
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First we note that for \y\ larger than x then if 7r(x) € Co(ffi) then in the tails 
'!r{y)/'!r{x) < 1, so we can say 


a{x,y) < 



f _ 

V 2/1 [ \y\^ 


XT' \) 


Since if y = a: then a{x,y) = 1, we will only concern ourselves with |y| > |a;|. 
In effect we are now considering the set A^ ,: U (—x, x), but since this is strictly 
larger than it will give us the result. For y > x, if we write y = x + z for 
some z > 0 (and do similar in the other tail), we can see that 


a(x, X + z) < 


X 


X + z 


exp - 


cz^ cz'^ \ 

2 h(x + z)'*' 2 hx~>) 


As X —)■ oo, the first term on the right-hand side will tend to something greater 
than zero for z = 0(x) and decay to zero for the set of z’s that grow at a larger 
rate than x . Inside the exponential, the term cz'^/2h{x -|- z)'*' —)■ 0 for any z as 
X grows. The last term cz^/2/ix^ will only increase with x for the set of z’s that 
grow at a faster rate than x^/^. If we denote this set of ‘extreme’ values for y 
which would be accepted as then it is clear that Q(x, —> 0 

for any e > 0 , as E^^e ~ (— oo, — x'’'/^+‘*) U (x'’'/^+‘^, oo) for some (5 > 0 , and this 
set will be sent deeper and deeper into the tails of Q{x, ■) as |x| grows. 

So now we can focus on Ax,e(~^Tx, or equivalently consider the set of possible z 
values in (—2x'>'/^, 0 )U( 0 , 2x'’'/^). For any of these the dominant term in a(x, x-l- 
z) will be (x/(x -I- so the acceptance rate will be strictly decreasing in z 

on this set. Hence we need only examine the boundary points, y = x + 2^/hx'^^‘^ 
and y = X — 2'/hx'^^‘^, and show that these both decay to zero as x —>■ oo. 

For y = X -|- 2\fhx^l'^ the acceptance rate becomes 


a(x,y) < 


X -I- 2\/hx'^l'^ 


7/2 


i -"" 

1_ 

2 h 

\x + 


< 


X -I- 2\/hx'^l‘^ 


7/2 


exp 


2c 

•\//( 


0 . 


And for y = X — 2^/hx'^^^, noting that for large x |x — 2^/hx'^^^\ > we 

have 

^ 0 . 
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Appendix B. Needed facts about truncated Gaussian distributions 


Here we collect some elementary facts used in the article. For more detail 
see e.g. m- If X follows a truncated Gaussian distribution (/r, cr^) then 
it has density 

f{x) = l[a.b](a;), 

where 4>{x) = $(a;) = J^^4'{y)dy and Za,b = $((& - y)/cr) - 

$((a — fj,)/a). Defining B = {b — y)/a and A = {a — fi)/cr, we have 


E[A] = /i + 


0(A) - 0(H) 



and 

$(H — at) — <i)(A — at) 

Za,b 

In the special case b = oo, a = 0 this becomes 




Appendix C. Rectangle contour plots 

The contour plots show the region of proposals which would be accepted if 
the current point is given by the green dot. The area in the lower half of the 
ellipse which is coloured yellow is larger than that in the upper half (shown in 
red), implying that on average the vertical coordinate (and hence V{x)) will be 
smaller for the next point in the chain. 
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